Tabular Data

DSST 289: Introduction to Data Science

Erik Fredner

2024-08-28

Homework check-in

  1. Received Blackboard announcement?
  2. Read the notes for today?
  3. Installed R, RStudio, and the tidyverse?
  4. Completed questions at the end of the notes?
  5. Uploaded completed questions to Blackboard?

Tabular Data

  • Tabular data is a common data structure.
  • It is a table (“table” → “tabular”) with rows and columns.
  • Rows contain observations.
  • Columns contain variables (which are also occasionally called features).

Excel is tabular

Table with dog breed, height, weight

Observations (dogs) go in rows

Dogs with highlighted row.

Variables go in columns

Dogs with breed column highlighted.

tibbles are tabular

breed <- c("Shih-Tzu", "Labrador", "Beagle", "Newfoundland", "Chihuahua", "Affenpinscher")
weight <- c(5.5, 33, 10.2, 70, 1.3, 9.6)
height <- c(24, 56, 34, 69, 20, 27)
dogs <- tibble(breed, weight, height)
dogs
# A tibble: 6 × 3
  breed         weight height
  <chr>          <dbl>  <dbl>
1 Shih-Tzu         5.5     24
2 Labrador        33       56
3 Beagle          10.2     34
4 Newfoundland    70       69
5 Chihuahua        1.3     20
6 Affenpinscher    9.6     27

One variable: breed

dogs |>
  # `select()` says "select these columns."
  select(breed)
# A tibble: 6 × 1
  breed        
  <chr>        
1 Shih-Tzu     
2 Labrador     
3 Beagle       
4 Newfoundland 
5 Chihuahua    
6 Affenpinscher

One observation: "Beagle"

dogs |>
  filter(breed == "Beagle")
# A tibble: 1 × 3
  breed  weight height
  <chr>   <dbl>  <dbl>
1 Beagle   10.2     34
  • == is a comparison operator that means “is equal to.”
  • We’re saying, “Filter this tibble for rows where breed is equal to "Beagle".”
  • Note the distinction betwen variables and observations.

Data types

  • Every variable (column) has one and only one data type.
  • Two common data types in the dogs table:
    • character (text, e.g., "Beagle")
    • numeric (numbers, e.g., 10.2)

Data types in dogs

dogs
# A tibble: 6 × 3
  breed         weight height
  <chr>          <dbl>  <dbl>
1 Shih-Tzu         5.5     24
2 Labrador        33       56
3 Beagle          10.2     34
4 Newfoundland    70       69
5 Chihuahua        1.3     20
6 Affenpinscher    9.6     27
  • <chr> is short for character
  • <dbl> is short for double, which is a type of numeric

Tabular data takeaways

  • Good data structure is the most important thing we will learn in this class.
  • Real-world data science is largely about gathering, cleaning, and organizing data.
    • In real life, data scientists spend most of their time doing this.
  • Don’t take good data for granted!

Introduction to some key concepts in R

Objects

  • Everything in R is an object.
  • We usually create new objects by assigning values to names:
almost_pi <- 3.14
almost_pi <- 3.1415
almost_pi <- 3.141592653589793238462643383279
# note the rounding:
almost_pi
[1] 3.141593

Functions

Functions take inputs and generate outputs.

round() is a function that rounds a number (and/or a vector of numbers)

round(9.5)
[1] 10

Comparison operators

Comparison operators compare two values and return either TRUE or FALSE.

# Is the rounded value of 9.4 equal to 10?
round(9.4) == 10
[1] FALSE
# Is the rounded value of 9.4 less than 10?
round(9.4) < 10
[1] TRUE

Functions take arguments

  • Arguments may be named. If unnamed, arguments are evaluated by position.
  • If named, they may be evaluated in any order.
# here, we're rounding the almost_pi object
round(almost_pi)
[1] 3
round(almost_pi, digits = 6)
[1] 3.141593
# this is the same as above because digits is the first argument
round(almost_pi, 6)
[1] 3.141593
# this is the same as above, but because we name x, we can invert the order:
round(digits = 6, x = almost_pi)
[1] 3.141593

Pipes

  • Pipes |> chain functions together.
  • They pass the output of one function to the first input (often x) of the next function.
  • They improve readability and reduce the need for intermediate objects.

Why are pipes |> useful?

# this is hard to read
# you evaluate expressions from the innermost to the outermost
abs(tan(log(exp(8), base = 2)))
[1] 1.645831
# this is annoying to write
temp <- exp(8)
temp <- log(temp, base = 2)
temp <- tan(temp)
temp <- abs(temp)
temp
[1] 1.645831
# pipes are easier to read and easier to write
8 |>
  exp() |>
  log(base = 2) |>
  tan() |>
  abs()
[1] 1.645831

Formatting code

  • Well formatted code is not just nice.
    • It’s essential when you share your code with others, or need to read it later.
  • The formatting guidelines we use are not my opinions.
  • Some rules I will enforce:

Spacing around operators

Put spaces before and after operators.

# bad spacing is hard to read
bad<-1+2/3

# good spacing
good <- 1 + 2 / 3

But as in English prose, no space before a comma:

# bad spacing is unnatural
bad <- round(almost_pi , 2)

# good spacing
good <- round(almost_pi, 2)

Pipe spacing

Pipes |> require vertical and horizontal spacing:

# bad spacing
8|>exp()|>log(base=2)|>tan()|>abs()
[1] 1.645831
# good spacing: note the indentations (tabs) after the first pipe
8 |>
  exp() |>
  log(base = 2) |>
  tan() |>
  abs()
[1] 1.645831

Naming stuff

Bad object names make code hard for you and your readers to understand.

Rules for names:

  1. Use lowercase letters, numbers, and underscores.
  2. Use snake case (e.g., snake_case).
  3. Write the shortest, clearest name you can.

Naming stuff: examples

  • Horrible: WeightOfDogInKilograms
  • Bad: weight_of_dog_in_kg
  • Okay: dog_weight
  • Best: weight

The last and best option is only available if your data is structured correctly!

Running code in .Rmd files

There are many ways to run code in an .Rmd file:

  • Click the green play button in the RStudio script editor.
  • Use the Command Pallette (Cmd + Shift + P) and search for “Run the current code chunk.”
  • Keyboard shortcuts: Ctrl + Shift + Enter (Windows) or ⌘ + Shift + Enter (macOS)
  • Click the Knit button to render the entire document as an HTML file.

In-class practice

  1. Go to Course Documents in Blackboard.
  2. Download notebook (.Rmd) file from the folder 01-Tabular Data.
  3. Move that file to your the nb directory in your DSST289 folder:
    • DSST289/nb/notebook01.Rmd
  4. Open that file in RStudio and try the problems.